Characterising Emergent Semantics in Twitter Lists
نویسندگان
چکیده
synsets appear in the top of the hierarchy, while more specific ones are placed at the bot tom. Thus, Wu and Palmer [26] propose a similarity measure which includes the depth of the synsets and of the least common subsumer (see equation 1). The least common subsumer les is the deepest hypernym tha t subsumes both synsets, and depth is the length of the pa th from the root to the synset. This similarity range between 0 and 1, the larger the value the greater the similarity between the terms. For terms measure and communication, both synsets have depth 4, and the depth of the les abstraction is 3; therefore, their similarity is 0.75. wp(synseti, synset^) = 2 * depth(lcs)/(depth(synseti) + depth(synset2) (1) Jiang and Conrath [16] propose a distance measure tha t combines hierarchical and distributional information. Their formula includes features such as local network density (i.e., children per synset), synset depth, weight according to the link type, and information content IC of synsets and of the least common subsumer. The information content of a synset is calculated as the inverse log of its probability of occurrence in the WordNet hierarchy. This probability is based on the frequency of words subsumed by the synset. As the probability of a synset increases, its information content decreases. Jiang and Conrath distance can be computed using equation 2 when only the information content is used. A shorter distance means a stronger semantic relation. The IC of measure and communication is 2.95 and 3.07 respectively while abstraction has a IC of 0.78, thus their semantic distance is 4.46. jc(synseti, synset^) = IC(synseti) + ICi^synset^) — 2 * IC(lcs) (2) We use, in section 4, the pa th length, Wu and Palmer similarity, and Jiang and Conrath distance to study the semantics of the relations extracted from Twitter lists using the vector space model and LDA. 3.2 Linked D a t a to Ident i fy R e l a t i o n T y p e s WordNet-based analysis is rather limited, since WordNet contains a small number of relations between synsets. To overcome this limitation and improve the detection of relationships, we use general purpose knowledge bases such as DBpedia [4], OpenCyc, and UMBEL 3 , which provide a wealth of well-defined relations between concepts and instances. DBpedia contains knowledge from Wikipedia for close to 3.5 million resources and more than 600 relations. OpenCyc is a general purpose knowledge base with nearly 500K concepts around 15K types of relations. UMBEL is an ontology with 28,000 concepts and 38 relations. These knowledge bases are published as linked data [3] in R D F and with links between them: DBpedia resources, and classes are connected to OpenCyc concepts using owhsameAs, and to UMBEL concepts using umbel#correspondsTo. Our aim is to bind keywords extracted from list names to semantic resources in these knowledge bases so tha t we can identify which kind of relations appear between them. To do so we harness the high degree of interconnection in the linked data cloud offered by DBpedia. We first ground keywords to DBpedia [12], and then we browse the linked data set for relations connecting the keywords. After connecting keywords to DBpedia resources we query the linked data set to search for relations between pairs of resources. We use a similar approach to [14] where SPARQL queries are used to search for relations linking two resources rs and rt. We define the pa th length L as the number of objects found in the pa th linking rs with rt. For L = 2 we look for a relation^ linking rs with rt. As we do not know the direction of relation^, we search in both directions: 1) rs relationi rt, and 2) rt relation^ rs. For L = 3 we look for a pa th containing two relationships and an intermediate resource node such as: rs relationi node, and node relationj rt. Note tha t each relationship may have two directions and hence the number of possible paths is 2 = 4. For L = 4 we have three relationship placeholders and the number of possible paths is 2 3 = 8. In general, for a pa th length L we have n = ^4=2 2('~) possible paths tha t can be traversed by issuing the same number of SPARQL queries on the linked data set. For instance, let us find the relation between the keywords Anthropology and Sociology. First both keywords are grounded to the respective DBpedia resources, in this case dbpr:Anthropology and dbpr:Sociology. Figure 2 shows linked data relating these DBpedia resources. To retrieve this information, we pose the query shown in Listing l . l . 5 The result is the triples making up the pa th between 2 OpenCyc home page: http://sw.opencyc.org/ 3 UMBEL home page: http://www.umbel.org/ 4 Note that for large L values the queries can last long time in large data sets. 5 Property paths, in SPARQL 1.1 specification, allow simplifying these queries. the resources. In our case we discard the initial owhsameAs relation between DBpedia and OpenCyc resources, and keep the assertion tha t Anthropology and Sociology are Social Sciences. Keyword f ~\ Keyword anthropology ^w v _ y sociology grounding rd f : type / opencyc: \rdf:type grounding , . / social science \ , . f owl:sameAs / \ owl:sameAs T dbpr:Anthropology opencyc :anthropology opencyc :sociology dbpr: Sociology Fig. 2. Linked data showing the relation between the anthropology and sociology SELECT * WHERE{ ? r e l a t i o n l ?nodel. ?nodel ?re la t ion2 ?node2. ?re la t ion4 ?node3. ?node3 ?re la t ion3 ?node2.} Listing 1.1. SPARQL query for finding relations between two DBpedia resources 4 Experiment Description D a t a Set: Twitter offers an Application Programming Interface (API) for data collection. We collected a snowball sample of users and lists as follows. Starting with two initial seed users, we collected all the lists they subscribed to or are members of. There were 260 such lists. Next, we expanded the user layer based on current lists by collecting all other users who are members of or subscribers to these lists. This yielded an additional set of 2573 users. In the next iteration, we expanded the list layers by collecting all lists tha t these users subscribe to or are members of. In the last step, we collected 297,521 lists under which 2,171,140 users were classified. The lists were created by 215,599 distinct curators, and 616,662 users subscribe to them 6 . From list names we extracted, by approximate matching of the names with dictionary entries, 5932 unique keywords; 55% of them were found in WordNet. The dictionary was created from article titles and redirection pages in Wikipedia. O b t a i n i n g R e l a t i o n s from Lists: For each keyword we created the vectors and the bags of words for each of the three user-based representations defined in section 2. We calculated cosine similarity in the corresponding user-based vector space. We also run the LDA algorithm over the bags of words and calculated the cosine similarity between the topic distribution produced for each document. We kept the 5 most similar terms for each keyword according to the Vector-space and LDA-based similarities. The data set can be found here: http://goo.gl/vCYyD
منابع مشابه
Automatic Stopword Generation using Contextual Semantics for Sentiment Analysis of Twitter
In this paper we propose a semantic approach to automatically identify and remove stopwords from Twitter data. Unlike most existing approaches, which rely on outdated and context-insensitive stopword lists, our proposed approach considers the contextual semantics and sentiment of words in order to measure their discrimination power. Evaluation results on 6 Twitter datasets show that, removing o...
متن کاملiPLUG: Personalized List Recommendation in Twitter
A Twitter user can easily be overwhelmed by flooding tweets from her followees, making it challenging for the user to find interesting and useful information in tweets. The feature of Twitter Lists allows users to organize their followees into multiple subsets for selectively digesting tweets. However, this feature has not received wide reception because users are reluctant to invest initial ef...
متن کاملFault-tolerant Emergent Semantics in P2p Networks
To survive in the twenty-first century, enterprises need to collaborate. Collaboration at the enterprise-level presupposes the interoperability of the underlying information systems. Access to heterogeneous information sources must be provided transparently while maintaining their autonomy. Further, the availability of nearly unlimited information calls for efficient and precise information ret...
متن کاملCovering the Egonet: A Crowdsourcing Approach to Social Circle Discovery on Twitter
Twitter and other social media provide the functionality of manually grouping users into lists. The goal is to enable selective viewing of content and easier information acquisition. However, creating lists manually requires significant time and effort. To mitigate this effort, a number of recent methods attempt to create lists automatically using content and/or network structure, but results a...
متن کاملOn Refining Twitter Lists as Ground Truth Data for Multi-community User Classification
To help scholars and businesses understand and analyse Twitter users, it is useful to have classifiers that can identify the communities that a given user belongs to, e.g. business or politics. Obtaining high quality training data is an important step towards producing an effective multi-community classifier. An efficient approach for creating such ground truth data is to extract users from exi...
متن کاملSupporting the Curation of Twitter User Lists
Twitter introduced lists in late 2009 as a means of curating tweets into meaningful themes. Lists were quickly adopted by media companies as a means of organising content around news stories. Thus the curation of these lists is important, they should contain the key information gatekeepers and present a balanced perspective on the story. Identifying members to add to a list on an emerging topic...
متن کامل